DOCOiEHT BBSOHE 



ED 190 605 

AOTHOB 
TITLE 

POB DATE 
NOTE 



EDPS PBICE 
DESCRIPTORS 



IDENTIFIERS 



Ta BOO 397 



Kolen, Michael J- 

comparison of Traditional and Latent Trait Theory 
Methods for Equating Tests. 
Apr 80 

2**p. ; Paper presented at the Annual Meeting of the 
American Educational Research Association (C^thf 
Sostonr »A# April 7-11, 19B0)- 

MF01/PC01 Plus Postage. 

♦Achievement Tests; ♦Difficulty Level: ♦Equated 
Scores: Guessing (Tests) : High Schools: Latent Trait 
Theory: Quantitative Tests: *Test I tens: True Scores: 
vocabulary Skills 

Equipercentile Equating: ♦lowa Tests of Educational 
Developsaent: Linear Equating Method 



ABSTRACT 

Results froD equipercentile, linear, and latent trait 
equating of the vocabulary and quantitative thinking tests of the 
Iowa Tests of Educational Development were coapared. The study 
entailed both the equating of forms (of similar difficulty) and the 
equating of levels (of differing difficulty) . The goal was to equate 
seventh edition tests to those of the sixth edition. The data were 
item responses froi a representative sample of 10,728 Iowa high 
school students, one-, two-, and three-parameter logistic latent 
trait methods were used. The results from the equating methods were 
compared using a cross-validation criterion which measured the 
closeness of converted score distributions to actual score 
distributions for randomly equivalent groups. The one-parameter 
methods results were judged inadequate for equating tests differing 
in difficulty, possibly because of prevalent examinee guessing. The 
three- parameter methods results were promising although two problems 
were discussed which require further study. Presently, equipercentile 
procedures may be the most viable for equating tests of differing 
difficulty^ (Author/CP) 



* Reproductions supplied by EDRS are the best that can be made ♦ 

♦ from the original document. ♦ 
♦*»**♦♦«♦««♦«*♦«*«**«******♦*♦♦♦♦♦*♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦♦*♦ 



ERIC 



CGMPIBISGET OF TBADXTIOmL ASD UISSSPS TaXTS Tl 



us DEPARTMENTOP HEALTH, 
E cue AT IOf< & WELFARE 
NATIONAL INSTITUTE OF 
EDUCATION 

THIS POCUMfN. MAS BEEN RePRO- 
DUCfcD FXACTiV AS RECEIVED FROM 
ThF PE f^SON OR ORGANIZATION ORIGIN- 
AT!NOtT POINTS Of VIEW OR OPINIONS 
STAtrO DO NOT NECESSARttV REPRC* 
NT OPJ^ <CtAl NATIONAL INSTITUTE 0^ 
f PUtATiON POSiTlON OR POLICV 



Eof stra University 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



Paper presented at the Axsmial Meeting of the Jmerloan Educational 
Research Aesoolatlon in Boston, April, 1980. 



* The anthor greatly appo^eoiates the assistance of Leonard S. Feldt 
and Bonglas R. Whitney in all phases of ttie study. Helpful sug- 
gestions hy Robert 1. 7orayth, H* D. Hoover, William B. Coffisan, 
and an anonjnaous iERA reviewer are also a^reoiated. Any inaccuraoies 
are the sole responsihility of the author* 



ASSSBACT 



Basttlts from oqi^ipdroentllo, linear* and latent trait eqiaatlng of the 
vocabulary and qaantlta>ti>Fe thlnVlTig teats of the Iowa Teata of Sduoatlosal 
SevelopDant were oomj^ared. Teata of siiailar and of differing diff ioulty were 
equated. The da,t« were item responaea from a reapresentatlTe aaaiplo 10,728 
Iowa high achool atudenta. One-ttvo-^axid thoree-paramater logiatlc latent trait 
aetboda were used. <Ehe reaulta ftom the e(|iating methoda were coa^ared using 
a oroaa-validation criterion vbXdh neasured the oloaeneaa of con\rerted aoore 
diatributiona to aotual aoore diatrihutiona for randoaly eqisivalent groupa* 

The one-paraiBeter methods reaulta were judged inadeqiuate for equating 
teats differing in difficulty, possibly because of prevalent examinee guessing. 
IQie three-parameter methods results were promising althou^ two problems were 
discussed require further study. Presently, eqtiiperoentile procedures may 

be the moat viable for equating tests of differing difficulty. 



(x^iBism OF msmQML m> iMms mn* tbeori msacsis m squasting isests 



Aohievamsnt test battarlas are typically published in several parallel forma 
with different levels for diff ere&t grades. The forms and levels of each test 
oakiag vcp the battery most be equated to one another. That iSf every score on 
a ffiven form or level moat he translatable into a soore value on any other form 
or level of that test. 

Bquipereentlle and linear methods traditionally have heen used to equate tests 
(Angofff 1971). Latent trait methods recently have been advocated as possible 
improvements oyer the traditional methods (Lord, 1977; Vri0itt 1977). Lord (1977) 
argued from theoretical considerations that traditional equating methods are not 
appropriate fo? equating tests of differing diffioulty* ^diereas latent trait theory 
methods have the oapaoity to provide an appropriate equating in this case. 

Lord's (1977) defixiitlon of equating iagplias that exact equating is possible 
only >dien the teats to be equated measure the same unidimensional ability, 
lohievemsnt tests covering abilities encountered over a range of grades are 
probably not unidimenaional* Bbwever, the usefulness of a soore scale vd.ll be 
severely limited unless it spana>all of the levels for whiich the battery is 
intended. Thus, equating of levels must be attempted even vidien UDldimsnsionality 
does not hold. 

The intent of this study was to ooispare the end results of tvo traditional 
and seven latent trait theory equating schemes using data from the 1978 equating 
project of the lovfa Tests of Educational ])evelopm»nt (ITSD). The study entailed 
both the equating of forms (of similar difflcxaty) and the equating of levels 
(of differing difficulty). 



4 



1 cross-validation group vas used to establish a oritorlon for oompaxias ihs 
rssults of the e<^tl0g methods. A cross-validation suosaary statistic was 
calculated %d)loh was a laeaksure of the closeness of converted score distributions 
for stratified raadonly equivalent groups. OJhe goal of the study vas to Identify 
the method or »thods ^oh were "best" according to this criterion and to exsmlna 
Idlosyncraclas of the equating scheoes for the equating of a two-level hl^^ school 
aohlev<»nent test battery. 

Teat Equating Definitions 

Non-parallel tests X and Y (that Is, tests maasurlng the sane unldlmenslonal 
ability but differing In difficulty or reliability) can be considered to be 
equated If aziy two examinees of equal true ability, one taking test Z and the other 
»^Ti><iig test 7, would be estpeoted to obtain the saoe score when perf omanoe on test 
X and test Y are eacpressed on a oooaaon score scale. Thla will be referred to aa 
the definition of equating for non-narallel tests. - 

According to Lord (1977 > p. 128) tests X and 7 .san be considered to be 
equated *'... If and only if It Is a natter of indifference to each exaolnee whether 
he Is to take test X or teat 7**. IQils definition iffiplles that the <<Af'4w<-nn« nt 
equating non-narallel tests holds. It also inpUss that for any population of 
examinees with equal ability the dlstrlhutlon of observed scores on test X will 
be Identical to that of test 7. Eenoe, the standard error of measurement (as wnll 
as the hitler order mooenta) for any individual (or group of individuals of 
identical ability) must be the same for test X as for test 7 %daen the scores are 
expressed on the comson score scale. Lord (1977) explained that this definition 
can be expeoted to hold only vhsxi test X and test 7 are carefully constructed 
parallel forms. Hence, the above definition will be referred to aa the definition 



ERIC 



5 



ef eenatlag for parallel teats . Note that both of the equating def ialtlona 
require that the testa to b© equated aeaflrure the same ualdioensiooal ability. 

For equipercentile or linear eq:uating to be exact, the Aaf initton of emiatlng 
for parallel taata iBuat hold. TMa is necessary beoauae these methods require 
that a ooanon score scale be oonatruoted such that idectical expected frequency 
distributions for the two teats will result for any subgroi^ of exaadnees. 
fhus, oomntional equipercentile or linear eqaating can b« strictly used only 
with parallol tests. In theory, the latent trait eq[uatlaff methods proposed by 
Iiord (1977) and Wti^t (1977) can be used to equate both parallel and non-parallel 
testa under the definitions discussed here. 

geview of Bouating Research 
Two t5?es of studies have been carried out \Aich assess the adequacy of 
various equatinff methods. In the first type the adequacy of a single equating 
scheoe is assessed by examining the similarity of the results obtained 
from disparate groups. The groups may differ in such characteristics as ability, 
socio-economic status, or race. These studies are based on the principle that, 
i£ the ^i»f<nitiQ n of equating for non-parallel tests holds, then equatings based 
on diverse groups should be Identical, apart from sampling error. In the second 
type of study the end results of various methods have been compared to one another. 
The present study is of this latter type. 
Studies TJaing Different Groups 

TArtn (1975, p.207) coucluded that the equipercentile equating of elementary 
school reading tests of similar difficulty in the Anchor Tost Study (Lcret, Seder, 
Bianohini, and Vale,. 197U) was "quite satisfactory for most practical purposes." 
In a reaaalyB±», Slinde and Linn (1977) focused on eqvdpercentile equating across 
grades. They concluded that vtien teats differed substantially in difficulty, 
equipercentile results were inadequate. 



Several studies have axaialsad tha eff eots of uaizig dlfrsrs&t sub-gcoupa of 
Basoh sealed items (e.g. odd-even* easy-dlffioult) on Basch ability estioatea 
for differing groups of exaninees . One set c.f atvdies (Curry, Bashaw, and Rents, 
1978; Tinsley and Sawis, 1975; Wbitely and Davis, 197U} and Wri^ibt, 1968) led to 
the conclusion that the ability paraaeter is, in fact, invariant over item sub- 
groups. 

In another set of studies (Loyd and Hoover, 1979; Slinde and Liji:., 1978 1 and 
Slinde and Linn, 1979) Baaoh-based equatings of tests of substantially different 
difficulty were found to be hi^ily dependent on the ability of the groups 
Gustaf sson (l979b) and Slinde and Limi (1979) hypothesized that the effects of 
guessing Biay have contributed to differences in equating results. Guatafsson 
(l979a) showed that exa&inee guessing on a test may result in a negative correlation 
' between item difficulty and discrimination. Since Slinde and Linn (1979) found 
evidence that such a negative correlation doea occur, the effects of guessing may 
have been a factor in their results. 
Studies Comparing Kathods 

Lord (1977), Marco (1977). and Woods and Wiley (1977» 1978) compared some 
conventional and latent trait theory equating methods. These studies indicate 
that the eq.uating scheoAs rt..died produce soaewliat different result a. 

Bents and 3ashaw (1977) reanalyzed The inohor Teat Data ^miag Basch equating 
procedures and concluded that the Basch and equipercentile equating results were 
reasonably similw. However, Slinde and Linn (1977) pointed out that the equi- 
peroentile method was not adequate for tests of differing difficulties. Thus, 

''it should be noted that \diile Basch model equating procedures (Wright, 1977) 
were used in these studies, Basch Hoc' el test construction procedures (Wri^t, 1977) 
were not. 



it ooasot be detas&iXMd v^etbar th« BmoL .jcooedurea provided aajr additloDflil 
benefits over "^inadeqiQAte" equiperoentiXe oetbods. 

Kbsoo- Petersea, and Stevmct (1979) cosipared a variety of egoiperoentile* 
liztear, and latent trait theory e^aating methods for equating the verbsd portion 
of the Sofaolastio Aptitude Test. When a test vas equated to itself with an anohor 
test of sliBilar difficulty all b«t ens of the methods appeared to be satisfactory. 
The exception was one of the variations of the equiperoentile method. ^ linear 
equating procedures appeared to produce the most accurate equating In this situation. 

When tests of different difficulty were equated, the latent trait methods 
were superior and the linear methods clearly inferior. Ecwever, Mareo et al. 
(1979) noted that the criterion used for ;)udging the superiority of equating 
methods may have been biased againat certain of the methods for eq^tlng tests of 
differing difficulty. Hence, cenolusions based on these results are v^ry 
tentative. 

!Elce studiee reviewed here indicate that traditional and latent trait methods 
can be expected to produce adequate equating results when p«rsllel tests sre 
equated. Little empirical evidence exists for the si^eriority of any equating method 
for tests of differing difficulty. It appears that linear equating is not a 
sound procedure. F^blems have also been found with equiperoentile and Basoh 
methods. If, as Slinde and Linn conclude, examinee guessing accounts for the 
failure of the Basch method, then the three-parameter logistic model should 
provide a more suitable approach with tests of differing difficulty. 



8 



E<matlng Problem 

Tho ITED 

13i» sevMith edition of the TSESi lAoludes separate testa in aevra areas. 
The tests are designed for administration to hi^ school sttidents. ▲ desoription 
of the tests anl the philosophy uBderlyinff their oonatruotion is presented in the 
m Manual for idoinistrators and Testing Directors (1972)* Only t%PO of the 
m tests - Tooabulary and quantitative thinking - were analysed in the present 
study. 

ISoB sixth edition of the ISSD consists of one level administered in all 
grades. She new seventh edition of the ZTSD has two levels* with one pair of 
parallel foms (X-7 and T-7) at each level. Level I of the seventh edition is 
designed for adzsiniatration to students in grades 9 and 10 and Level II for 
a^oinistration in grades 11 and 12. 

The sixth edition vooabulary and quantitative thinking tests contain UO and 36 
items and have time limits of 15 and hS minutes » respectively. Level I and Level II 
of the seventh edition forms each have the same number of items and time limits 
as their sixth edition counterparts. One-third of the seventh edition items are 
common to Level I and Level II. Ho items contained in the sixth edition are 
included in the seventh edition. 

In general, Level I of the seventh edition tests are easier than their 
sixth edition counterparts. Level II of each test is similar in difficulty to 
the sixth edition version. 



9 



Bquatlne Project for the Seventh Edition 

Th6 goal of the ITED equating project ^as to equate seventh edition 
tests to those of the sijcth edition* The study 'was based on the scores of 
10, 728 high school sttdents frcsa 3U Iowa schools* The schools chosen for 
inclusion in the project represented the fall xange of averages esdiibited 
hy Iowa schools, as Inferred from their previous year* s perf onnance* 

Within each 9th and 10th grade classroom included in the project, fonns 
X-6, Z-7 Level 1, and 7-7 Level I of the entire battery were administered to 
random thirds of the students, ^thln each 11th and 12th grade classroom, 
f cms X-6, X-7 Level U, and 7.7 Level U of the whole battery were administered 
to randan thirds of the students* Because of the randca assignment of forms 
to students within each classroaoii the three groups at each level can be 
considered stratified razuicm samples — stratified with respect to class 
and school* Sach pupil took only one foxm of the tests* 

f or the present study, students with missing scores, aero scores, or 
perfect scores were eliminated because latent abilities of such studoits cannot 
be est^ted with latent trait estimation procedures* The number of 9th through 
10th grade students included in the present study ranged from 1,883 taking Level 
I of foiw 7-7 of the vocabulary test to 1,925 taking foxm X-6 of ths vocabulary 
test* Similarly, the numbers of Uth through *12th graders ranged frcm 1,579 
ta3dng level U of foxm X-7 of the vocabulary test to 1,6U3 taking form 2-6 
of the quantitative tJilnking test* Evezy third student within each f om and 
test ccnbisation was withheld from the equating portion of the study. Their 
scores were used as a oross-validatian check for the equating. This aspect 
of the stu(^ will be explained in a later section of this paper. 



10 



Bquatiflg Itothoda 

One •qalp«roentild, one llnaw, and B©v«n latent trait theory oqiiating 
Qwthods ««r« oompared. ingoff (1971 ) bas pxrorided a tborouc^ diaouasion of 
llaaar and aq^lpa^oentlle nethoda* OTorviewa of latent trait tbeor^r and latent 
trait tliaoty equating have been awppUed l>y Baker (1977); Cook and Stobleton (1977) J 
Baoblaton, Swaoinathan, Cook and EiAtwa (1979); Kolen (1979); and lord (I975fl977). 
Ovcrvleva of the Sasoh model, i^oh ia one of tha latent trait oodela, have been 
provided by Vt:i^t (1977) and \Jri^t and Stone (1979). "Ebe followlne diaowsaion 
aaauaea fanlliarlty wiuh at least aoma of tbeae refeireno<*a. 

fhe Z-6 raw aoore aeale waa uaed aa the eoBston aooro aoale. For those 
equating oethoda reqjiiring interpolation, linear interpoUtion vas used as a 
tlffls-aaving device . Identical procedurea vere followed for foros X and Y of 
the TPoabttlary and quantitative thinking testa. 
Squloercentlle and Linear Methoda 

Hathod U-1 deaoribed by ingoff (1971) w»» ^ov linear eq^ting and 
Method IA^2 for eqiilpercentlle equating. First, Level I of eaoh seventh eclition 
test and fona \ias equated to form X-6, using the combined data for gradea 9 and 10. 
!Ehen, Level II of the seventh edition vas equated to form X-6 using only the 
11th and 12th g»de data. . 
Latent IKrait Methods 

One-, tvo-, and three-paraaater logiatio latent trait models were used. 
Additionally, a modified one-paraaeter model vas included, vtioh the oosnon 
slope of the item oharaoteriBtic eurves vas allowed to differ Arom the siztb 
to the seventh edition forms. Slallar procedures were followed for each of the 
latent trait models. 



11 



-9- 

Tha ability bsA Itwn paraswtttrs wora eatiaiated using the Wood, Viiigeraky, 
aad Lord (1976) LOGIST oooputar program. Beoausa ona«-third of the Items vera 
cosanon to tba tvo levels » the paraoetera for I»evela I asd II of each asventh 
edition teat form were aatlmated using almoltaneotia inrocedures. The parameters 
for the sixth edition tests were estimated using standard XOGIST procednrea. 

Tb» item and latent ahility paraosters for the seventh edition tests were 
then equated to the sixth edition scale. Thia was aooooplished by using the 
fact that the randomly equivalent groups taking forma X-7 and Y-7 would be 
expected to have identical distributions of latent ability, apart f2*om; saapling 
error. For the one-parameter model, the mean latent ability was used to equate 
seventh edition ability and item parasMter estimates to the si^h edition scale. 
For the three remaining latent trait models* linear equating was completed, 
using the mean and atandard deviation of the latent ability estimates* Hence, 
forms X-6, X-7t and Y-7 were on the same latent ability scale for each of the 
four latent trait models. 

Batimated true score equating. The estimated true ecoro (Lord, 1977) of 
an individual with a given estimated latent ability la equal to the sum, over 
items, of the estimated probability of correctly answering each item. ITaing the 
non-linear estimation procedure ZSBESPS (mSL, 1978) « edition seven estimated 
true score equivalents of sixth edition integer scores were found. Similarly, 
aixth edition estimated true score equivalents of seventh edition integvir scores 
were found. Saoh testf form, and level combination of the seventh edition was 
equated to the oorxespondlng sixth edition test using these prooedures with the 
four latent trait models. 



12 



Estimated observed acore ecmatlag. Lord (1975 > 197?) haa ahovm tbat mttcar the 
latent trait poraotetera are eetioated, an estixoated observed distribution of rav 
scores can be eonstruoted uaisff tha geoeratiiie f orsiala for th£ grenor&dised 
binomial* Separate astixoated observed score diatributiona were nonsitnioted tor 
foraa X-7 Level I, X-7 Level II, M Level I, and Y-7 Level II as wall as for 
the 9th-10th and 11th-12tli graders taking form Each form^level of edition 

seven vaD then equated to the edition six raw score soalo using e^^Mroentile 
eqiaating of the appropriate estiaated observed soore distribut^.oiis. This procedure 
was followed for the modified one-paraEQeter» two-paraoeter, and three-paraaeter 
aw4el8* 

Methods and Abhreviations 

Xhe nine equating methods and their abbreviations are: 1) Conventional 
equiperoentilo (BQDl); 2) ConventiooAl linear (LIH); 3) Ilodified one-paraseter 
estimated true soore (TBQHl); U) Modified one-parameter estimated observed soore 
(SSTQBKI); 5) two-paraaeter estimated true score (SSSQ2){ 6) tvo-parameter eatlnated 

f 

observed score (SSI0BS2); 7) three^parameter estimated true soore (TSEQ3)t 8) three-* 
parameter estimated observe^' soore (SStOBS^); 9) One-parameter estimated true soore (TSEQl) 

Bvftluation Procedinres 
No demonstrably superior oriterion for Judging the relative accuracy of 
^ the various equating methods was available in this study, inierefore, the primary 

^ evaluative technique was to estimate the stability of the results vihen applied 

to a new findependent saoiple* 



13 



Two fre(iu«ncy dlatributlons of raw aoores on the sixth edition were 
oonatrttoted for stixdents in the orose-validation samplea —> one for 9th«10th 
graders and another for 11th-12th grade students. Likewiae, frequencsr diatributiona 
for the oross-Tftlidation soople r^ente taking forma X->7 and Y-7 were oonstruoted. 
Using the results Srca each equating sehe?]» the X-? and T-7 scores were converted 
to the X-6 scale. 

The orosB-validation criterion was the osan (oyer examinees in the X-6 
oroas-validation diatrihution) squared difference between aixth edition Integer 
aoorea and seventh edition oonTorted (equated) scores with identical percentile 
ranks in randomly equivalent eroas ^validation diatrihutiona. Smaller valuAH of 
thia index reflect greater conaiatenoy between the aixth edition and converted 
seventh edition cross-validation distributions. For any particular teat, form, 
and level oosabination of the aeventh edition, smaller values of the index were 
interpreted as indicating more atable equating for that method. 

Batimated true scores below the "pseudo-chance** level of a test are 
undefined for the three-parameter logiatio model. In ordw to include the three- 
paraaietw: eatiaated true score method in the cross-validation, scores of one ^ 
any pair of teat a were arbitrarily considered to be equivalent i "missing" 
equivalents below the "pseudo-chance" level were arrived at by linear interpolation 

Hesults 

Cross-validation statistic valuea are ahown in Table 1 . 
Inaert Tabl^ 1 ibout Eere 



Based on the eross-vmlidation statistic values » ranks were assigned to the methods 
for each test* fora, and level oosl)ination. Vithin each lavelt the four test 
and form ooaibinations vera treated as randoxoly seXeoted blocks and the nine 
equating sethods as treatments in the oaleulation of tvro Friedman statistics 
(Cono^er, 1971). A Priedman statistic surpassing the appropriate critical val\io 
Indicates that overall, the msthods differed in the cross-validation. The 
Ftiednan statistic for Level I was 25.0%(p<;.0l) for Level II 16.8 (p<.05). 
Kendall's co«ffioi«3it of concordance (Conoverf 1971 )« a measure of the average 
correlation among ranka»was 0.78 for Level I and 0.52 for Level II. 

At Level I of the teats the three-parameter estioated obfiierved score 
distribution method appeared to produce the most accurate results. The 
equiperoentile method produced more accurate results, at least for the 
quantitative thinking tests, than the remaining methods. The linear scheme 
jaroduoed the least accurate results ai»l the one-parameter true score equivalents 
method the next leaat aocurate equating results. The results from the other 
methods appeared to he indistinguishable at Level I. 

For Level II, the three-parameter estimated true score equivalents scheme 
tended to produce the most aocurate results. In all cases, the one-parameter 
estimated true score e<|iivalents method produced more accurate results than the 
modified one-parameter estimated true score equivalents metliods. The results 
tseok the other methods seeosd to be indistinguishable. 

Discussion 
Sfluirercentile va. One-Parameter Methods- 

One notable finding vaa that the eqixiperoentile isethod produced more 



15 



aoovate (nroBS-yaXidatlon results tban the oiw-pasamftter or modified one-paraioster 
trtt» 8Cor« ©(luivsaenta methods for Level I of both tests. Level I was a downward 
oxtex»ion of the sixth edition test and* henoe, was an easier test. Therefore* 
a Qooibination of exaoixsee guessing and the eqvAting of tests of differing difficulty 
was present in the equating of Level I of the seventh edition to the edition 
•ix aoore scale. 

Sote that the one-'pararaeter true soore equivalents equating proosdtcres are 
identical to Batch equating procedures (Vri^t, 1977) and differ from Rasoh model 
equating only in the procedure used in test construction. As Gustaf sson (l979ar 
1979b) and Slinde and Linn (1979) have pointed out, if guessing is prevalent 
with the Basoh model then item difficulty and discrimination could be expected 
to be negatively correlated, For the two-parameter logistic model, the 
correlations between item difficulty and discrimination parameter estimates for 
total testa ranged from -0' .liSi? to -O.7081 (itediaa - -.6813). The bree- 
parameter logistic model oorrelations ranged from 0.031^0 to O.3670 (Median - O.IO69). 
llxtts, the inclusion of the lower aays^tote parameter resulted in mln-tmal correlationsl 
lEhese findings suggest that the failure of the one-parameter and modified one- 
parameter schemes to take guessing into account may have reduced their effectiveness 
at Level I of the tests. Since Level I of the seventh edition was of substantially 
lesser difficulty tban the sixth edition tests, these data are consistent with the 
Guatafsson (1979a, 1979b) and Slinde and Linn (1979) conclusion that the prevalence 
of examinee guessing may have an adverse effect on the equating of test of 
differing difficulty using the one-parameter methods. 



16 



.11;- 

Another notable resiat wis that the inodlf ied cjne-parameter true score 
equivalent (TEQSO.) metliod produced more accurate cross-validation results for 
Level I and less accurate results for Level H of botii seventh edition tests 
than did the one^arameter true score equivalents (TSEQl) methods These one- 
paraa^ter and modified one-parameter schemes differ only in the manner in 
which the overall (ccbbmc) discriminaticai of tho item characteristic curves 
is band3.ed* For the one-parameter method, the common item characteristic 
curve discrimiiiation for the seventh edition was forced to equal the common 
discrimination of the sixth edition curves. For the modified one-parameter 
scneme, while there was a ocoimon discriJsij:iation for all sev^ith edition 
curves of a particular form it was allo?fed to differ from the common discrim- 
ination of the sixth edition curves. 

Negative correlations were fotmd between the difficulty and discrimination 
parameter estimates of the tHO-parimeter model. Hence, the average discrimination 
of Level I items, when guessing was not taken ijxto account by a lower asyugjtote 
parameter, was greater than that of Level U. Coc^jaring the cross-validation 
findings from iiie one-and modified one^parameter true score equivalents 
methods, it would appear that the items of Level II of the seventh edition had 
item diacrimiaatioB similar to those of the sixth edition. The items of Level 
I probably had greater item discrimination than those of the sixth edition. 
The correlation between item difficulty and discrimination was probably a 
result of examinee guessing. Therefore, the differences betsfeen the cross- 
validaUon results for the one-and modified one-parameter true score equivalents 
may h«re resulted trm differential effects of examinee guessing on Level I 
axd Level H of the seventh edition. 



ERIC 



9^ 17 



Tbrea-Paganotttr Iiop^^tic Model 

Tbm thr««-par«net«r eatiiaattd observed soore distribution method tended to 
produce the oost aoourate oroes-valldation results at Leirel I of the tests but 
results of noderate accuracy at L«vel II, The three-parameter estiusated true 
score oQ^ivalents jaethod tended to produce the most accurate croos-validation 
results at Iisvel IX but results of aoderate accuracy at Level I- 

Ho conviaciag ea^laoatlon for the three-paraneter logistic model results could 
he found. However, two Intereeting facts nay be noted. First, the three-parareter 
estimated true score eqiuivalents method does not provide estimated true scores 
below the "pseudo-chance" level of the test, that is, below the sum of the lower 
asymptote parameter estimates, (interpolation was used to arrive at equated 
scores below this level in the cross-validation analyses). Thus, the estimated 
true score scale is a condexwed version of the observed score acala. This 
condensed scale probably differs from the observed score scale near the "pseudo- 
chance" level of the test and to a lesser extent along the entire score scale. 
Possibly, similar condensing of score scales occurs for tests of similar liff iculty 
hut differential condensing occurs for tests of unequal difficulty. If so, this 
partially explains the »<«^<«g that the three-parameter estimated true score 
equivalents method produced the most accurate cross-validation results at Level II 
and oonsiaratively less accurate results at Level I. 

Second, \&mx the three-parameter model parameters are estimatedv the LCXJIST 
program may be weak in accurately assessing the lower asymptote parameter. In 
this case, the lower asymptote estimate o£ those items for which this difficulty 
exists are fixed at a oosnon value. Of the seventh edlcion lower asymptote 



18 



•16- 

paraatters estijaattd, $2% of the itaaa on Level I only, 53% of tha Items oonaon 
to both levels, and 39% of the iteas on Level II only, were fixed at ooiaaon 
values. This possible failure for the items in Level I only probably had an 
effeot on the equatisff; the precise effect is not oleaT, hovever. 
Linear Kethod 

The results make it olear that the linear method is not satisfactory vt9n 
equating tests of uneqsal dif f loulty. The results for eqaating tests of similar 
difficulty suggest that the relationship between scores on the sixth and seventh 
edition tests are not linear throufi^icut the entire range of soores. 
Coanents on Cross-validation 

The cross-validation criterion was designed to be a measure of stability 
over random sasspling rather than a moaaure of accuracy of equating. The criterion 
was developed to indicate viaLdhf of a number of equating methods, produced the 
most consistent results in the equating of a set of pre-existing achievement 
tests rather then tests designed specifically to fit any one of the equating 
models. The exclusive uss of comparisons, and the fact that the sampling 
distribution of the cross-validatlen statistic is unknown, precludes definitive 
statements about the consistency of the methods. Studies such as Sllnde and Linn 
(1977, 1978, 1979) provide evidence of accuracy in a more absolute sense. Both 
comparative and absolute accuracy studies naed to be completed. 

Conclusion w 

The one-parameter models were found to prodiioe inadequate results, perhaps, 
because of the prevalence of examinee guessing. Unless examinee guessing is 
eliminated from test performanoe, possibly by usiJig the Wri^t and Stone (1979) 



ERIC 



19 



-17- 

proccdures for diaoasdlng items sbowlsff « laok of fit to the Hasch model, the 
present resesrch asd that of Slljode and Linn (1979) suggest that iaad^qoate 
results vill ooovr idies tests of differisg difficulty are equated. 

lihe three-parsxaeter logistic model seems promising as a model for test 
eq^tisg. Iovever» qoestioisa alTout the effects of eondensiag the score scale 
with the three-parameter estimated true score equivalents method and of the 
possible isadeqoate estimation of the lower asj^sptote parameter still need to 
be ansimred* 

fOxe eqoiperoentile method produced reasonably adeq;uate results. This 
method may presently be the most viable for eqt>atisg tests that differ in 
difficulty to the extent that they differed in the present study, even thoufiSi 
the eqoiperoentile OMthod could not be expected to produce a theoretically 
•'perfect" equating in this case. 



20 



« 



-18- 

iagoff t V. E. Scales, noxnst ocd aqaimLant aoortts. In E. L. 
IThomdlkt (5d. ) , SducatioBal Iteagugament ( 2cid ed. ) 
><MtiiiJ3^on, D. C. s Mssssiom Council on Bducation, 1971 . 

Baker, F, B. Advances in itea analysis. Eeview of Bducational 
aesearch. 1977, k'U 151-178. 

Cook, L. L. & Eaoibleton, H. E. application of latent trait models to 
the development of noro-refeareneed and criterion-referenced tests. 
Paper presented to Hational Council of Measurenent in Bducatiop 
'osonto,x978. 

Conover, W. S, Practical ponparanetrio statistics , ^ew York$ Wiley, 1971. 

Curry, A. E. , Bashaw, V. L, Bents, E. B. Invariance of Basch model 
ability paraaster estiasates over diffu^t collections of items. 
Paper presented to American Bduoational Sesearch Assooiation * 
Toronto, 1978. 

Gustafsson, J. -5. Testing and obtaining fit to the Basch model. 
Paper presented to American Bduoatio" *^ g»«fta rch Association , 
San franoisoo, 1979a. 

Gustafsson, J.-S. The Basch model in vertical equating of tests: A critique 

of Slinde and Linn. Journal of Sducatin ^fi M<»amri«wm<int . 1978, I6, 153-158. 

Hsmbleton, E. S. & Cook, L. L. Latent trait models and their use In 
the analysis of edueational test data. Journal of Sdicational 
Measareoent. 1977. jk* 75-96. 

Baobleton, E.L, Smoinathsn, H., Cook, L. L., Signer, D. B., & 
Gifford, J. Dewlopoents in latent trait theory, models, 
teohnieal issues, and applications. Eeviev of Sducationa l Beaearch. 
1978, M, U67-510. 

Bieronymous, A. ST. & Lindquist, B. P. Hsnual for administrators, supers 
visors, and counselors. Poras $ & 6. Iowa Tests of Basio SkillsT 
Io«<» City, la. t Iowa Testing Programs, 1972. 

DSL Library 1. (Portran 17) n«S/370-360. 7th ed. Haisbont 

Intsxnatiooal Matheaatioal and Statistical Libraries, Inc. 1978. 

ITBD Ktoual for Ag"<"<*^tors and Testing Directors . Porms X-6 & 
Y-6. Iowa City, U.s Iowa Testing Programs, 1972. 

Kolsn, M. J. CoBparisene of eouipercentile. linear and selecte d latent trait 
methods for eouatlng forms sad levels of the seventh edition of the 
Iowa Teste of Bdaeational development . Unpublished Ph.D. 
Dissertation, t^versity of Iowa, 1979 . 



21 

ERIC 



-19- 

Lijm, R. Ii. inohor t«flt 8tud7: Th» lone ^ short o£ it. 
Jounal of Eduoational IteagtigMatnj;, 197$ t ^ 201-211*. 

Lord, F. N. ▲ warr9y of tquAtiisff mthods bM«d on Itom charaoterlstio 
thtory, Regaaroh Ballet in 7^-13 . Princeton, N. J. s BduoationfO. 
Testing SerTlce, T975* 

Lord»F. M. Ftfteticftl applicatione of Item oharsoterietio eurre 

theory, Joiaga*! of Sducatloosl Meagurement. 1977» lU; 177-138. 

Loret, F. G. , S«d«r, , Blannhlnl, J. C. S» VaIs, C. inohor test 
atridy final report 8 Pro.leot reiwrt and volnjoee 1 throqgfa 30. 
Be^ley, Calif. ( Edncationsl Teatinff Servios, 1971^ 

loT^t B. 1. & Eoov«r, H. D. X oomparison of methoda of Tertic»l 

eq:a&tlag. Paper presented to K&tional Coqncil on Measureiaent In 
EdTttoation , San Fracoisco, 1979* 

Karoo, G. L. Item oharaoteristio curye solutions to three intraotahle 

testing problems. Journal of EduoationaJ^ Measureawit. 1977 ik 139-1^* 

Karoo, G. L., Petersen, H. S. & Stewart, B. 5. A test of the adequacy 
oarvilinear seors eq:aating models. Paper presented to 1979 Commiter 
Mapti^ Testing Conference . Minneapolis, 1979* 

Bents, H. S. & Bashaw, V. L. The national reference scale for readings 

an applieaiion of the Rasch oodel. Joinmal of Sduoational Meastirement, 
1977, lib 161-180. 

Slinde, J. A. & Linn, R. L. Vertically equated tests t Fact or phantom? 
Journal of Sdncational Keasurenent. 1977* jiij 23-32. 

Slinde, J. A. & Lian, R. L. An exploration of the adeqtjaoy of ths Haach 
model for the problem of Tsrtical eqaating. Journal of Sduoaticpal 
Measurement, 1978, jS, 23-35. 

Slinde, J. A. & Linn, R. L. A note on vertical equating via the Rasch model 

for groups of quite different ability and tests of quite different difficulty* 
Journal of Bduoational Msasurament . 1979* 1j|» 159-165. 

Tineley, H. R. & Dswis, R. V. in Investigation of the Rasch simple logistic model* 
Saople free itom and test calibration, Eduoational auad Psychological 
]ljeagnrenigt. 1975t 2i, 325-339. 

Wbitely, S. & »awis, R. 7. The nature of objectivity with tha Rasch model. 
Journal of Bdnoatioi^ Measurement , 197Uf JLl* 163-178. 



o 

ERIC 



-20- 

Vood, R. L. , Vlng«r«k3r, K. S. & Lord, F. H. LOaiSTt A oooputer program 

for •stinfttiBff •xamitt»« ability and it am charactariatie ourTt paraMtara. 
Raaaapch Mangrapdum 76-6. Prinoaton, N. J. t Sdueaticnal Tastisg 
Sasvica, 1976. 

VoodBf B. H. & Wilajr, D. S. la application of it«a oharaotaristio ourra 
aquatifig to aisgla-form taata. Fapar praaentad to Parchomatric 
Sociaty . Cbapal Hill, N.C., 1977. 

Vooda, S* H. & Vilay, S. S. in application of item oharaetaristio ourva 

aquatinff to itam Bawpllng paokagaa on loilti-form taatt. Paper pratantad 
to Anarioan saucational Raeaarch Aaeooiation « Toronto, Canada, 1978. 

Wri^t, B. S. Sespla-fraa taat ealibration and paraon maaaurtoant in Prooeadinga 
of tha 1967 Invitational Confaranoa on Meeting Problana. Priaoaton, H.J. i 
Sdueational Taating Sarvloa, 1966. 

VSri^t, 2. I>. Solving mtaauraosnt probleoa with the Saaoh modal. Journal 
of Edmcational Maaanramant. 1977. 1h. 97-116 

V^i^t, B. B. & Stone, H. S. Bast taat daaiant A handbook for Haaoh maaauramant . 
Chicago I MSSA, .1979* 



1 • • » 



labia 1 

CroM-Validation Statistlo Talufta 



SQ0X LIN TSQIfl SST0BK1 TSSQ2 SS!!X}BS2 TSSq3 ^SCGBS^ TSB%1 

X 1.60 i|.30 1.8U TTto 1T23 1T07 0755 OT^T 2715 
Vooabulasy Y O.3O 3.0U 0.79 0.86 0.21* 0.57 I.03 0.1$ 1.96 



Qjaaatitatlva X 0.27 2.98 0.74 0.66 0.7U 0.62 O.63 0.i40 2.1*5 
'n ^4yiw-<« g Y 0.30 1.88 0.36 0.36 0.33 0.36 0.81 002 1.32 



Vocabulary X 1.1*3 0,9k 2.09 I.63 I.69 1-37 0.86 1.73 1-57 

Y U.09 2.i«0 3.U2 2.55 2.90 2.1*0 1.78 3.22 3|11 



Quantitative X 0.1*5 1-32 I.OI* 0.75 O.96 O.71 0.27 0.82 0.60 
lajliikioff Y 1.88 1.78 1.81 2.31 1.33 1-89 1.28 1.79 0.54 



o 

ERIC 



24 



